Imagine that you have been tasked with exploring the crime rates in different districts in the city of Boston through the police dataset. The specific districts we will be looking at is B2, B3, and D14. Some stats about main communities within these districts are shown below:
Mattapan and Roxbury (Districts B2 and B3)
Middlesex (District D14)
The findings for this project will be useful in uncovering certain trends of crime throughout Boston and the communities within it. The informaiton here could determine where certain resources will be distributed within the Boston government for reducing crime. Good luck!
First things first, let's get some terminology straight.
.ipynb file. These are pretty special, also known as Jupyter notebooks. Jupyter notebooks have a few special properties that make it ideal for work with data:
print()x = 'Answer to the Ultimate Question of Life, the Universe, and Everything'
print(x) # Run this cell after running the one above, and again after running the one below
Answer to the Ultimate Question of Life, the Universe, and Everything
x = 42
Anything you can do in Python, you can do here!
def UltimateQuestion(computer_name):
return computer_name + ' is thinking...'
UltimateQuestion('Deep Thought')
'Deep Thought is thinking...'
We use the pandas package to easily work with data as tables.
The numpy package allows us to work with some other special data types, like missing values
We'll rename these as pd and np, just so its easier to refer to later on
import pandas as pd
import numpy as np
For this semester, we'll typically work with data in tabular format, the type you'd be used to in an excel spreadsheet. Data files saved in this format will usually have a .csv file ending, short for comma seperated values.
For example, a CSV file could look something like...
INCIDENT_NUMBER, OFFENSE_DESCRIPTION, DISTRICT, SHOOTING,
PLTEST005, BURGLARY - RESIDENTIAL, B2 , True
PLTEST003, INVESTIGATE PROPERTY, B2, False
PLTEST002, INVESTIGATE PROPERTY, B2, False
To import this, let's use the pd.read_csv() function:
url = 'https://raw.githubusercontent.com/dt3zjy/node/master/week-1/workshop/boston_crime.csv'
crime = pd.read_csv(url)
Here, we've saved the data to a dataframe object named crimes
type(crime)
pandas.core.frame.DataFrame
DataFrames contain our data in little "spreadsheet"-like structures. Whatever manipulations you can think of doing to the data, you can likely search how to do
Let's take a look at the data. We'll use the function .head() to read in the first 5 rows
crime.head()
| Unnamed: 0 | INCIDENT_NUMBER | OFFENSE_DESCRIPTION | DISTRICT | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | STREET | Lat | Long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | PLTEST005 | BURGLARY - RESIDENTIAL | B2 | True | 2021-10-13 00:00:00 | 2021 | 10 | Wednesday | 0 | SCHROEDER PLZ | 42.333679 | -71.091878 |
| 1 | 1 | PLTEST003 | INVESTIGATE PROPERTY | B2 | False | 2021-05-12 00:00:00 | 2021 | 5 | Wednesday | 0 | SCHROEDER PLZ | 42.333679 | -71.091878 |
| 2 | 2 | PLTEST002 | INVESTIGATE PROPERTY | B2 | False | 2021-05-12 00:00:00 | 2021 | 5 | Wednesday | 0 | SCHROEDER PLZ | 42.333679 | -71.091878 |
| 3 | 3 | 222000599 | FRAUD - CREDIT CARD / ATM FRAUD | B2 | False | 2021-12-29 22:40:00 | 2021 | 12 | Wednesday | 22 | WASHINGTON ST | 42.328663 | -71.085634 |
| 4 | 4 | 222000583 | LARCENY THEFT FROM MV - NON-ACCESSORY | B2 | False | 2021-12-28 12:00:00 | 2021 | 12 | Tuesday | 12 | WASHINGTON ST | 42.328663 | -71.085634 |
How big is the dataset? .shape returns a tuple with the dimensions as (rows, columns)
crime.shape
(21697, 13)
Let's try to understand our data a bit better.
crime.OFFENSE_DESCRIPTION.nunique()
107
crime.OFFENSE_DESCRIPTION.value_counts()
INVESTIGATE PERSON 2341
M/V - LEAVING SCENE - PROPERTY DAMAGE 1598
SICK ASSIST 1501
INVESTIGATE PROPERTY 1254
VANDALISM 982
...
OPERATING UNDER THE INFLUENCE (OUI) DRUGS 2
BREAKING AND ENTERING (B&E) MOTOR VEHICLE 2
EXPLOSIVES - TURNED IN OR FOUND 1
AIRCRAFT INCIDENTS 1
FIREARM/WEAPON - ACCIDENTAL INJURY / DEATH 1
Name: OFFENSE_DESCRIPTION, Length: 107, dtype: int64
Show the most recent crime by sorting the dataframe:
crime.sort_values(by='OCCURRED_ON_DATE', ascending=False)
| Unnamed: 0 | INCIDENT_NUMBER | OFFENSE_DESCRIPTION | DISTRICT | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | STREET | Lat | Long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | 14 | 222000197 | PROPERTY - LOST THEN LOCATED | B3 | False | 2021-12-31 23:30:00 | 2021 | 12 | Friday | 23 | MENTON ST | 42.272983 | -71.078050 |
| 23 | 23 | 212095530 | INVESTIGATE PROPERTY | B2 | False | 2021-12-31 23:21:00 | 2021 | 12 | Friday | 23 | HUMBOLDT AVE | 42.314303 | -71.088566 |
| 17 | 17 | 222000005 | THREATS TO DO BODILY HARM | B2 | False | 2021-12-31 22:53:00 | 2021 | 12 | Friday | 22 | ABBOTSFORD ST | 42.314630 | -71.092615 |
| 24 | 24 | 212095529 | ASSAULT - AGGRAVATED | B2 | False | 2021-12-31 22:47:00 | 2021 | 12 | Friday | 22 | WARREN ST | 42.316966 | -71.082541 |
| 25 | 25 | 212095528 | ASSAULT - SIMPLE | B2 | False | 2021-12-31 22:45:00 | 2021 | 12 | Friday | 22 | MISSION PARK DR | 42.333783 | -71.109343 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 18616 | 18616 | 212013148 | FRAUD - FALSE PRETENSE / SCHEME | B3 | False | 2021-01-01 00:00:00 | 2021 | 1 | Friday | 0 | BLUE HILL AVE | 42.284826 | -71.091374 |
| 16848 | 16848 | 212021001 | LARCENY ALL OTHERS | B2 | False | 2021-01-01 00:00:00 | 2021 | 1 | Friday | 0 | ALLERTON ST & MAGAZINE ST\nROXBURY MA 02119\n... | 42.327551 | -71.069162 |
| 8568 | 8568 | 212058122 | FRAUD - FALSE PRETENSE / SCHEME | B2 | False | 2021-01-01 00:00:00 | 2021 | 1 | Friday | 0 | WASHINGTON ST | 42.328663 | -71.085634 |
| 14259 | 14259 | 212032563 | FRAUD - FALSE PRETENSE / SCHEME | B3 | False | 2021-01-01 00:00:00 | 2021 | 1 | Friday | 0 | BLUE HILL AVE | 42.284826 | -71.091374 |
| 21066 | 21066 | 212002295 | LARCENY THEFT FROM BUILDING | B2 | False | 2021-01-01 00:00:00 | 2021 | 1 | Friday | 0 | BILODEAU RD | 42.328663 | -71.085634 |
21697 rows × 13 columns
Subsetting is a super helpful tool. We'll take a look at this more depth in next week, but for now, here are the basics:
We can filter rows from a dataframe based on some condition
WASHINGTON STcrime[crime.STREET == 'WASHINGTON ST']
| Unnamed: 0 | INCIDENT_NUMBER | OFFENSE_DESCRIPTION | DISTRICT | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | STREET | Lat | Long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 3 | 222000599 | FRAUD - CREDIT CARD / ATM FRAUD | B2 | False | 2021-12-29 22:40:00 | 2021 | 12 | Wednesday | 22 | WASHINGTON ST | 42.328663 | -71.085634 |
| 4 | 4 | 222000583 | LARCENY THEFT FROM MV - NON-ACCESSORY | B2 | False | 2021-12-28 12:00:00 | 2021 | 12 | Tuesday | 12 | WASHINGTON ST | 42.328663 | -71.085634 |
| 7 | 7 | 222000555 | PROPERTY - LOST/ MISSING | B2 | False | 2021-12-31 12:00:00 | 2021 | 12 | Friday | 12 | WASHINGTON ST | 42.328663 | -71.085634 |
| 8 | 8 | 222000554 | FRAUD - FALSE PRETENSE / SCHEME | D14 | False | 2021-12-30 00:00:00 | 2021 | 12 | Thursday | 0 | WASHINGTON ST | 42.349056 | -71.150498 |
| 9 | 9 | 222000511 | M/V ACCIDENT - OTHER | B2 | False | 2021-12-30 00:00:00 | 2021 | 12 | Thursday | 0 | WASHINGTON ST | 42.328663 | -71.085634 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21684 | 21684 | 212089764 | M/V - LEAVING SCENE - PROPERTY DAMAGE | B2 | False | 2021-12-08 09:25:00 | 2021 | 12 | Wednesday | 9 | WASHINGTON ST | 42.328663 | -71.085634 |
| 21688 | 21688 | 212092227 | M/V - LEAVING SCENE - PROPERTY DAMAGE | D14 | False | 2021-12-08 07:50:00 | 2021 | 12 | Wednesday | 7 | WASHINGTON ST | 42.349056 | -71.150498 |
| 21690 | 21690 | 212092221 | PROPERTY - LOST/ MISSING | B2 | False | 2021-12-16 18:00:00 | 2021 | 12 | Thursday | 18 | WASHINGTON ST | 42.328663 | -71.085634 |
| 21691 | 21691 | 212092219 | PROPERTY - LOST/ MISSING | D14 | False | 2021-11-12 00:00:00 | 2021 | 11 | Friday | 0 | WASHINGTON ST | 42.349056 | -71.150498 |
| 21695 | 21695 | 121094465 | PROPERTY - LOST/ MISSING | B2 | False | 2021-12-23 23:45:00 | 2021 | 12 | Thursday | 23 | WASHINGTON ST | 42.328663 | -71.085634 |
2633 rows × 13 columns
How would you show only crimes north of the Museum of Fine Arts in Boston (Lat > 42.3394)
Hint: Same way as matching if statements in python, mirroring the syntax above
crime[crime.Lat > 42.3394]
| Unnamed: 0 | INCIDENT_NUMBER | OFFENSE_DESCRIPTION | DISTRICT | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | STREET | Lat | Long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 8 | 222000554 | FRAUD - FALSE PRETENSE / SCHEME | D14 | False | 2021-12-30 00:00:00 | 2021 | 12 | Thursday | 0 | WASHINGTON ST | 42.349056 | -71.150498 |
| 12 | 12 | 222000337 | AUTO THEFT | D14 | False | 2021-07-14 10:22:00 | 2021 | 7 | Wednesday | 10 | WASHINGTON ST | 42.349056 | -71.150498 |
| 19 | 19 | 212929649 | ASSAULT - SIMPLE | D14 | False | 2021-03-31 20:10:00 | 2021 | 3 | Wednesday | 20 | NEVINS ST | 42.349286 | -71.146727 |
| 29 | 29 | 212095483 | M/V - LEAVING SCENE - PROPERTY DAMAGE | D14 | False | 2021-12-31 20:06:00 | 2021 | 12 | Friday | 20 | ALLSTON ST | 42.352111 | -71.135311 |
| 33 | 33 | 212095470 | LARCENY SHOPLIFTING | D14 | False | 2021-12-31 18:44:00 | 2021 | 12 | Friday | 18 | CAMBRIDGE ST | 42.353709 | -71.137039 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21668 | 21668 | 212089933 | M/V ACCIDENT - PERSONAL INJURY | D14 | False | 2021-12-08 15:21:00 | 2021 | 12 | Wednesday | 15 | KENRICK ST | 42.346779 | -71.170624 |
| 21676 | 21676 | 212089802 | FRAUD - CREDIT CARD / ATM FRAUD | D14 | False | 2021-12-07 13:00:00 | 2021 | 12 | Tuesday | 13 | WASHINGTON ST | 42.349056 | -71.150498 |
| 21677 | 21677 | 212089798 | INVESTIGATE PERSON | D14 | False | 2021-12-08 15:00:00 | 2021 | 12 | Wednesday | 15 | WESTERN AVE | 42.361538 | -71.145512 |
| 21688 | 21688 | 212092227 | M/V - LEAVING SCENE - PROPERTY DAMAGE | D14 | False | 2021-12-08 07:50:00 | 2021 | 12 | Wednesday | 7 | WASHINGTON ST | 42.349056 | -71.150498 |
| 21691 | 21691 | 212092219 | PROPERTY - LOST/ MISSING | D14 | False | 2021-11-12 00:00:00 | 2021 | 11 | Friday | 0 | WASHINGTON ST | 42.349056 | -71.150498 |
4703 rows × 13 columns
What is the percentage of crimes where a shooting occurred?
len(crime[crime['SHOOTING']==True])/len(crime)*100
2.401253629534037
First things first, let's import the package to help us visualize the data, plotly.
If this package isn't yet included, we can install it using !pip install plotly. More on this week 5.
import plotly.express as px
Note that we're using the sub package of the broader package, called plotly express. This simplifies a lot of the more difficult steps
Plotly express has a broad range of options to play with, let's take a look at the documentation.
Do a quick google search to pull up documentation for px.scatter OR run px.scatter? in a Jupyter cell
px.histogram?
Let's look at the top ten most frequent crimes
crime_sample = crime.sample(frac=.2)
crime_sample = crime_sample[crime_sample['OFFENSE_DESCRIPTION'].isin(crime_sample['OFFENSE_DESCRIPTION'].value_counts()[0:10].index)]
fig = px.histogram(crime_sample, x='OFFENSE_DESCRIPTION', color='DISTRICT', opacity=.8)
fig.show()
Look at the stacked bar graph. What does the data tell you?
Let's take this data into a geographic plot. These crimes happen at certain PLACES, so let's see if we can find a trend.
fig = px.scatter_mapbox(crime_sample[crime_sample['DISTRICT']!='B2'], lat='Lat', lon='Long', mapbox_style="stamen-terrain", zoom=10.5,
color='OFFENSE_DESCRIPTION', hover_name='STREET', hover_data=['STREET','SHOOTING'],
)
fig.show()